Mining Programming Language Vocabularies from Source Code
نویسندگان
چکیده
We can learn much from the artifacts produced as the by-products of software development and stored in software repositories. Of all such potential data sources, one of the most important from the perspective of program comprehension is the source code itself. While other data sources give insight into what developers intend a program to do, the source code is the most accurate human-accessible description of what it will do. However, the ability of an individual developer to comprehend a particular source file depends directly on his or her familiarity with the specific features of the programming language being used in the file. This is not unlike the difficulties second-language learners may encounter when attempting to read a text written in a new language. We propose that by applying the techniques used by corpus linguists in the study of natural language texts to a corpus of programming language texts (i.e., source code repositories), we can gain new insights into the communication medium that is programming language. In this paper we lay the foundation for applying corpus linguistic methods to programming language by 1) defining the term “word” for programming language, 2) developing data collection tools and a data storage schema for the Java programming language, and 3) presenting an initial analysis of an example linguistic corpus based on version 1.5 of the Java Developers Kit.
منابع مشابه
A Lightweight Approach to Uncover Technical Information in Unstructured Data
Developer communication through email, chat, or issue report comments consists mostly of largely unstructured data, i.e., natural language text, mixed with technical information such as project-specific jargon, abbreviations, source code patches, stack traces and identifiers. These technical artifacts represent a valuable source of knowledge on the technical part of the system, with a wide rang...
متن کاملXML and the art of code maintenance
We present three distinct XML vocabularies and demonstrate integration with o:XML source code. The vocabularies relate to interface documentation, unit tests and Design By Contract conditions. By layering the information with XML namespaces, syntax conflicts are avoided and selective processing can be performed using standard XML tools. Automated unit tests and the implementation of Design By C...
متن کاملDeclarative Visitors to Ease Fine-grained Source Code Mining with Full History on Billions of AST Nodes by Robert Dyer, Hridesh Rajan, and Tien N. Nguyen
Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at...
متن کاملMedlineR: an open source library in R for Medline literature data mining
SUMMARY We describe an open source library written in the R programming language for Medline literature data mining. This MedlineR library includes programs to query Medline through the NCBI PubMed database; to construct the co-occurrence matrix; and to visualize the network topology of query terms. The open source nature of this library allows users to extend it freely in the statistical progr...
متن کاملTemplate Mining in Source-code Digital Libraries
As a greater number of software developers make their source code available, there is a need to store such opensource applications into a repository, and facilitate search over the repository. The objective of this research is to build a digital library of Java source code, to enable search and selection of source code. We believe that such a digital library will enable better sharing of experi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009